A Similarity-Based Soft Clustering Algorithm for Documents

نویسندگان

King-Ip Lin

Ravikumar Kondadadi

چکیده

Document clustering is an important tool for applications such as Web search engines. Clustering documents enables the user to have a good overall view of the information contained in the documents that he has. However, existing algorithms sufSer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorithms (where each document can belong to multiple clusters) are usually ineficient. We propose SISC (Slmilarity-based Soft Clustering), an eficient soft clustering algorithm based on a given similarity measure. SISC requires only a similarity measure for clustering and uses randomization to help make the clustering eficient. Comparison with existing hard clustering algorithms like K-means and its variants shows that SISC is both effective and eficient.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

A word-based soft clustering algorithm for documents

Document clustering is an important tool for applications such as Web search engines. It enables the user to have a good overall view of the information contained in the documents. However, existing algorithms suffer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorit...

متن کامل

Algorithms for Soft Document Clustering

Aim of this paper is to highlight the possibilities of clustering algorithms called as the "soft clustering" algorithms. The traditional approach "hard clustering" allows us to include only one document cluster. Soft clustering algorithms, like Fuzzy C-means (FCM), Word Base Soft Clustering (WBSC) Similarity-Based Soft Clustering Algorithm (SISC) and Kondadadi and Kozma modified ART (KMART), al...

متن کامل

An improved opposition-based Crow Search Algorithm for Data Clustering

Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

A Similarity-Based Soft Clustering Algorithm for Documents

نویسندگان

چکیده

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

A word-based soft clustering algorithm for documents

Algorithms for Soft Document Clustering

An improved opposition-based Crow Search Algorithm for Data Clustering

عنوان ژورنال:

اشتراک گذاری